Whittle index based Q-learning for restless bandits with average reward
نویسندگان
چکیده
A novel reinforcement learning algorithm is introduced for multiarmed restless bandits with average reward, using the paradigms of Q-learning and Whittle index. Specifically, we leverage structure index policy to reduce search space Q-learning, resulting in major computational gains. Rigorous convergence analysis provided, supported by numerical experiments. The experiments show excellent empirical performance proposed scheme.
منابع مشابه
On the Whittle Index for Restless Multi-armed Hidden Markov Bandits
We consider a restless multi-armed bandit in which each arm can be in one of two states. When an arm is sampled, the state of the arm is not available to the sampler. Instead, a binary signal with a known randomness that depends on the state of the arm is available. No signal is available if the arm is not sampled. An arm-dependent reward is accrued from each sampling. In each time step, each a...
متن کاملOn an Index Policy for Restless Bandits
We investigate the optimal allocation of effort to a collection of n projects. The projects are 'restless' in that the state of a project evolves in time, whether or not it is allocated effort. The evolution of the state of each project follows a Markov rule, but transitions and rewards depend on whether or not the project receives effort. The objective is to maximize the expected time-average ...
متن کاملIndex Policies for a Class of Discounted Restless Bandits
The paper concerns a class of discounted restless bandit problems which possess an indexability property. Conservation laws yield an expression for the reward suboptimality of a general policy. These results are utilised to study the closeness to optimality of an index policy for a special class of simple and natural dual speed restless bandits for which indexability is guaranteed. The strong p...
متن کاملLearning of Uncontrolled Restless Bandits with Logarithmic Strong Regret
In this paper we consider the problem of learning the optimal dynamic policy for uncontrolled restless bandit problems. In an uncontrolled restless bandit problem, there is a finite set of arms, each of which when played yields a non-negative reward. There is a player who sequentially selects one of the arms at each time step. The goal of the player is to maximize its undiscounted reward over a...
متن کاملModel-Based Average Reward Reinforcement Learning
Reinforcement Learning (RL) is the study of programs that improve their performance by receiving rewards and punishments from the environment. Most RL methods optimize the discounted total reward received by an agent, while, in many domains, the natural criterion is to optimize the average reward per time step. In this paper, we introduce a model-based Average-reward Reinforcement Learning meth...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Automatica
سال: 2022
ISSN: ['1873-2836', '0005-1098']
DOI: https://doi.org/10.1016/j.automatica.2022.110186